1 Introduction

It is September of 2020, right in the middle of the COVID-19 pandemic and during the California statewide shutdown. You have just been hired by the California Department of Public Health (CDPH) in Vital Records to track non-COVID mortality trends for the past 4 years for which all data have currently been aggregated and de-identified (anonymized). You are told that the purpose is two-fold:

  1. Help the state identify shifts in spatiotemporal mortality trends as a result of the COVID statewide stay-at-home orders.

  2. Help the state build predictive models for non-COVID causes of mortality to establish a pre-COVID baseline.

1.1 Provided data: California mortalities

You were given properly de-identified data from 2017 - 2020, which contain mortality records for the state of California by gender, leading cause of death, race / ethnicity, and location of death. All data have been taken from the death certificates filed with the state.

NOTE: If you are interested in these data, the source is the California Death Profiles dataset maintained by California Health and Human Services.

## Load the provided data files
load("california_mortality_tables.RData")

You are provided 6 tables total, which have the following variables.

Table 1. Data dictionary of variables in the California mortality tables.
All Tables
Year Year of record
Month Month of record
County California county where death was recorded
Annual Mortality Table
annualCount Number of deaths for the sex specified
raceEthnicity Race and/or Ethnicity of the deceased; Latinx is unfortunately treated as a racial category here rather than an ethnic group
Specialized Variables
leadingCauseDeath Leading cause of death per death certificate; See separate table for encoding
ageBins Age groups where the age at death was known
ageDeaths Number of deaths for the age bin specified
biologicalSex Biological sex of decedent
sexDeaths Number of deaths for the sex specified
placeDeathOccurred Location of death, per death certificate
placeDeaths Number of deaths matching specified criteria
raceDeaths Number of deaths matching race / ethnic criteria
totalDeaths Number of deaths per month, year, & county
Table 2. Underlying cause of death category based on coding using the Tenth Revision of the International Classification of Diseases (ICD) codes
3-letter Code Description ICD-10 Codes
HTD Diseases of the Heart I00-I09, I11, I13, I20-I51
CAN Malignant Neoplasms (Cancers) C00-C97
STK Cerebrovascular Disease (Stroke) I60-I69
CLD Chronic Lower Respiratory Disease (CLRD) J40-J47
INJ Unintentional Injuries V01-X59, Y85-Y86
PNF Pneumonia and Influenza J09-J18
DIA Diabetes Mellitus E10-E14
ALZ Alzheimer’s Disease G30
LIV Chronic Liver Disease and Cirrhosis K70, K73-K74
SUI Intentional Self Harm (Suicide) U03, X60-X84, Y87.0
HYP Essential Hypertension and Hypertensive Renal Disease I10, I12, I15
HOM Homicide U01-U02, X85-Y09, Y87.1
NEP Nephritis, Nephrotic Syndrome and Nephrosis N00-N07, N17-N19, N25-N27]
OTH All Other Causes of Death All codes not listed above

NOTE: Make sure to refer to these tables as you go!

3 Project future mortality

Your stakeholders have asked if you can quickly give them any idea of what to expect mortality to be across the counties for the remaining months of 2020. Now, even though you knew it would not be the most robust method for time series projection available (a caveat!), you immediately think to yourself, “Oh, I could do a regression.” So, you do.

3.1 Regression predictions of mortality in 2020

First, you need to do a little prep to the data to get you started. You need to calculate the monthly percent mortality and attach it to your mortality shapefile, mortality_sf2.

## First, calculate monthly percent mortality in your dataframe:
mortality_sf2 <- mortality_sf2 %>% 
  ## Group by County as well as Year
  group_by(COUNTY_NAM, Year, Month) %>% 
  ## Summarize by calculating a monthly percent mortality
  ## Note that I have to divide by the mean of the N from the Census  
  ## or I'd get an error
  summarize(monthlyPercMortality = sum(totalDeaths, 
                                       na.rm = TRUE)/mean(totalN)*100) %>% 
  ## Ungroup - necessary to remove the grouping for future use
  ungroup()

Now, fit the OLS linear regression.

olsModel <- lm(monthlyPercMortality ~ COUNTY_NAM + Year + Month, data = mortality_sf2)

Next, make some new data (stored into newdata) so you can generate predictions for those new data.

## Make the new data:
## Get the names of the counties; repeat four times for the 4 months
COUNTY_NAM <- rep(unique(mortality_sf2$COUNTY_NAM), 4)
## Make a set of months for each of the 58 counties for Sept, Oct, Nov, Dec
Month <- c(rep(9, 58),
            rep(10, 58),
            rep(11, 58),
            rep(12, 58))
Year <- rep(2020, length(Month))
## Put this into a new dataframe called "newdata"
newdata <- cbind.data.frame(COUNTY_NAM, Year, Month)

Now, generate the predictions using newdata and attach the predicted monthly percent mortality to newdata. This is necessary for plotting to occur! Make sure you check out the dimensions of the newdata. Since you are missing four months’ of data (September - December) in 2020, there she be a row for every county for every month missing.

## Attach it to the dataframe and then call it the same name as the original Y
newdata$percMortality <- predict(olsModel, newdata = newdata)
## Also, add a flag so that you know it is predicted
newdata$isPredicted <- "1"

Finally, you are ready to make a temporary dataframe, which you will call temp, which will pull everything together - the original data and the predictions - so that you can add them to your line graphs from above. Although I could choose to add the predictions to your mortality shapefile instead, I am choosing not to in order to keep that more pristine. I don’t want to add predictions that I may not use in the future to my good dataframe!

temp <- totalDeaths %>%   ## Start from totalDeaths for simplicity
  ## Join with census by County
  full_join(census, by = "County") %>%
  ## Group by Year and then Month to aggregate the counties
  group_by(Year, Month) %>% 
  ## Calculate the percent mortality - it is a percent because multiply by 100
  summarise(percMortality = sum(totalDeaths, na.rm = TRUE) / 
              sum(totalN, na.rm = TRUE)*100) %>%  
  ## Keep only the columns you need
  select(Year, Month, percMortality) %>% 
  ## Make an empty flag column filled with zeros to flag if the data are 
  ## predicted
  mutate(isPredicted = "0")

## Clean up newdata a little: you need to aggregate across the counties by year  
newdata <- newdata %>% 
  ## Group by Year and then Month to aggregate the counties; add isPredicted
  ## in there too so that it makes it into the final dataset!
  group_by(Year, Month, isPredicted) %>% 
  ## Remove what you don't need
  select(-COUNTY_NAM) %>% 
  ## Calculate a mean percent mortality
  summarize(percMortality = mean(percMortality, na.rm = TRUE))

## Row bind newdata onto temp to give you a complete, ordered, dataset for plots!
temp <- rbind(temp, newdata)

Lastly, finish out the year of 2020 with predictions (shown in bubble points):

Question 16 [1 point]:

Do you think this regression prediction is robust? Why or why not? (Hint: Check your assumptions with plot(olsModel)!!)

# Your code here.

Your answer here.

3.2 Spatial lag regression

Then you recall that class you took during your Master’s degree - what was it called again? Doesn’t matter! You remember that calculating spatial weights would allow you to do a type of spatial regression, such as spatial lag regression, because it deals with spatial autocorrelation. Right, that was the term!

3.2.1 Calculate spatial weights matrix

The first step in calculating a spatial weights matrix is to ensure geometry is retained in the shapefile, even if you plan to aggregate by month or by year for mapping. I actually had you covered earlier when you first calculated the monthly percent mortality, monthlyPercMortality, in the mortality shapefile, mortality_sf2. By ungrouping at the end of the calculation, I ensured that you lost no data or geometry! This is also exactly WHY you did not do any aggregation directly on these data or attach the subpar predictions to mortality_sf2.

Warning: Careless aggregation WILL rob you of spatial or other information, which will make spatial modeling impossible. This is because, in order to run spatial models, the spatial object must retain geometry.

Now that you know you have retained all of your spatial information in mortality_sf2, you can create the spatial weights matrix similarly to how you did it in Demo 2 using spdep.

  1. Create neighbor list using Queen contiguity. This could take a minute to run, be patient.

This identifies which counties touch each other (using poly2nb()). Note, however, that you are only looking at the unique pairs of county names and geometries, however, for this step.

mortality_nb <- poly2nb(mortality_sf2 %>% 
                          ## Isolate the unique county name-geometry pairs
                          select(COUNTY_NAM, geometry) %>% unique() %>% 
                          ## Arrange the county names alphabetically (helps
                          ## ensure no downstream issues)
                          arrange(COUNTY_NAM) %>% 
                          ## Turns the county name into a factor (helps ensure
                          ## no downstream issues)
                          mutate(COUNTY_NAM = as_factor(COUNTY_NAM)))
  

## Make sure the size matches the number of counties:
length(mortality_nb)
## [1] 58

Question 17A [0.5 points]

Why am I calculating the neighbors on just the distinct counties in the mortality shapefile when I just finished saying you don’t want to lose any of your spatial data for the model?

Hint 1: What are the dimensions of mortality_sf2 and is it a multiple of 58?

Hint 2: What happens if I calculate spatial weights on redundant data? Does it cost me computationally?

Your answer here.

  1. Calculate row-standardized spatial weights matrix based on the neighbors-list. This converts the neighbor information stored in mortality_nb into a usable weight list using nb2listw(). Neighbors have equal influence (row-standardized: style = "W").
mortality_sw <- nb2listw(mortality_nb, style = "W")

3.2.2 Join spatial weights to the mortality shapefile so that every datapoint has its spatial weight.

You want to calculate, for each county in each year and month, the average percent mortality of its neighboring counties (i.e., the spatial lag). This helps you model whether mortality in one county is influenced by nearby counties. Recall from lecture and Demo 2 that a spatial lag is the weighted average of a variable (here, mortality) in neighboring counties. For example, if Los Angeles county is surrounded by five counties, the spatial lag for L.A. county is the average mortality rate across those five neighbors, adjusted by the spatial weights.

To accomplish this, what you need to do is loop over the time components (year and then month), and compute spatial lag for within each of the four years. you will use purrr to do this; it’s more computationally efficient than writing a more simple for() loop. As it iterates over the dataset, by year, it will use spdep::lag.listw() to calculate the spatial lag for each time point. The result is a new column in mortality_sf2 that contains the average monthly mortality of any given county’s neighbors for every county-month-year combination. Don’t forget! You must center (not scale) the monthly percent mortality before computing the spatial weights.

## Make a centered transformation of monthly percent mortality
mortality_sf2$monthlyPercMortalityCentered <- scale(mortality_sf2$monthlyPercMortality,
                                    center = TRUE, scale = FALSE)

## Initialize an empty vector to store spatial lag values
mortality_sf2$lagged_monthlyPercMortality <- NA_real_

## Calculate spatial weights for every time point
mortality_sf2 <- mortality_sf2 %>%
  arrange(COUNTY_NAM, Year, Month) %>%  ## Ensures it is in same order as weights
  ## Handles each time slice separately
  group_by(Year, Month) %>%
  mutate(
    lagged_monthlyPercMortality = {
      ## Slice to get the current group's data (i.e., one month-year combo)
      ## and then sort within group to match spatial weights
      monYr <- arrange(cur_data(), COUNTY_NAM)$monthlyPercMortalityCentered
      ## Compute the spatial lag using the spatial weights in mortality_sw
      lag.listw(mortality_sw, monYr)
    }
  ) %>%
  ungroup()

3.2.3 Run spatial lag regression using the lagged monthly mortality as a covariate

In Demo 2, you used the spatialreg package’s lagsarlm() function to perform a spatial lag regression. However, because you needed to compute the spatial lag within each year-month combo for each county, you have to perform your spatial lag regression a little differently using just the lm() function and explicitly using the spatially lagged monthly percent mortality ourselves.

Question 17B [0.25 points]

Run a spatial lag regression (SAR) model using the spatial lag you just calculated. Save the model as an object called sarModel. I have gotten it started for you so you use the spatial lag correctly, but you need to fill in the blanks, run the analysis, and assess the results.

Hint 1: Your target (dependent variable) is unchanged from OLS regression.

Hint 2: Make sure to include both time elements as covariates!

# __ <- lm(__ ~ __ + __ + lagged_monthlyPercMortality, 
#     data = __)

Now, let’s compare sarModel to olsModel and see which model better explains monthly percent mortality in California (uncomment to run).

# anova(sarModel, olsModel)

Question 17C [0.25 points]

Which model, the spatial lag regression (sarModel) or the ordinary least squares regression (olsModel), is a better fit to the data? Why?

Your answer here.

Question 17D [0.5 points]

Using the model you chose as the better model, sarModel to olsModel, interpret the results of that model. HINT: Remember that summary() helps you access the results from the model!

Make sure to interpret both the coefficients and their significance as well as the \(R^2\)!

## Your code here.

Your answer here.

Question 17E [0.25 points]

Can you trust the results of the model? Why or why not?

## Your code here.

Your answer here.

4 Improve Predictions with Machine Learning

You are growing increasingly concerned that you will not be able to provide robust predictions to your stakeholder, and you’re feeling nervous - it is a new job, after all! So, you think about other ways to try to improve the prediction. Although you technically can’t see into the future, right now, you and I can - which is exactly what you are going to do! Unlike the random splits you have done before, you are actually going to perform what is called a time-aware split, where you use January 2017 to December 2019 as your training data and January to August 2020 as your testing data. This approach accounts for temporal structure in the data, as well as spatial structure in the data by using the lagged mortality you computed above as a spatial proxy.

Perform a time-aware train-test split

## Make the training data
mortalityTrain <- mortality_sf2 %>% 
  ## Drop the geometry to make processing downstream faster
  st_drop_geometry() %>% 
  ## Filter for just the year-month combos you want
  filter(Year != 2020) %>% 
  ## Drop the previously centered monthly percent mortality from when you made
  ## the spatial weights matrix
  select(-monthlyPercMortalityCentered)

## Make the testing data
mortalityTest <- mortality_sf2 %>% 
  ## Drop the geometry to make processing downstream faster
  st_drop_geometry() %>% 
  ## Filter for just the year-month combos you want
  filter(Year == 2020) %>% 
  ## Drop the previously centered monthly percent mortality from when you made
  ## the spatial weights matrix
  select(-monthlyPercMortalityCentered)

Question 18A [0.5 points]

What percent of the data are in the training set and what percent in the testing set? You should use code to figure this out! How does it compare to the canonical 80-20 split?

## Your code here.

Your answer here.

Question 18B [0.5 points]

What caveats or limitations can you think of by using none of the 2020 data in the training set?

Your answer here.

4.1 Fitting an out-of-bag (OOB) random forest (RF) model

You decide to use random forest (RF) for this task because it handles non-linearity, which you know you have from your time plots above AND from the summaries of both your OLS and SAR regressions. You also choose to use RF because it can be more robust to outliers and still give some idea of importance for interpretability for your stakeholders. Your basic workflow will be to:

  1. Center and scale the predictors

  2. Fit the model with OOB error

  3. Fit the model with 10-fold cross-validation

  4. Compare the RMSE of the two models

  5. Show the variable importance of the final model

You decide to leverage the caret package in R again for much of your model fitting and tuning needs. If you want to know more about what caret does and doesn’t do, this chapter can help.

Question 19A [0.5 points]

What does it mean to “fit a model with out-of-bag (OOB) error”? Why do we do it? What can it tell you here?

Your answer here.

Even though you have no encoding, you know that you need to center and scale your data before running the random forest! So, you decide to leverage the preProcess() function in the caret package to help you center and scale the predictors. You perform the preprocessing explicitly here, storing the pre-processed training instructions in preProcInstructions. Then, you can apply the pre-processing to the training data, mortalityTrain, using the predict() function. Note that this is a different step than when you are predicting from a model for performance evaluation!

## First, make a new object, preProcInstructions, that stores the 
## pre-processing instructions from mortalityTrain:
preProcInstructions <- preProcess(mortalityTrain, 
                                    method = c("center", "scale"))

## Now apply the pre-processing instructions using predict()
mortalityTrain <- predict(preProcInstructions,
                                      newdata = mortalityTrain)

Then, you fit the OOB RF model after using the trainControl() function to tell it to fit the training model with OOB error.

Fit an OOB random forest (RF) model

## Define training control for OOB
ctrlOOB <- trainControl(method = "oob")

## Define the tuning grid
tuneGrid <- expand.grid(mtry = 1:3)  ## Coerce it to use mtry 1 through 3

## Run the OOB random forest, including preprocessing instructions
rfOOB <- train(
  ## Define the model - everything but COUNTY_NAM
  monthlyPercMortality ~ Year + Month + lagged_monthlyPercMortality,
  ## Set the data
  data = mortalityTrain,
  ## Indicate that it is a random forest
  method = "rf",
  ## Indicate the training control method to use, defined above
  trControl = ctrlOOB,
  ## Indicate that you want to extract importance at the end
  importance = TRUE,
  ## Indicate the tuning grid, defined explicitly above
  tuneGrid = tuneGrid
)

Note that here you coerce it to use an mtry hyperparameter grid search of 1 through 3. This is just because you have a really non-complex dataset (only 3 predictors!) and you want to force it to run as if it had a little more complexity. Why? Just to make sure it searches long enough!

Question 19B [0.5 points]

What is mtry and what is it doing in the context of a random forest? What is the default mtry for a random forest regression? Why am I choosing to override it here, do you think?

Your answer here.

Look at the results (\(RMSE\) and \(R^2\)) of the OOB random forest

rfOOB$results %>% 
  kable(
    format = "html",
    caption = "Table 3. Results of the OOB random forest on California Mortality (All Causes).") %>%
    kable_styling(bootstrap_options = c("hover"))
Table 3. Results of the OOB random forest on California Mortality (All Causes).
RMSE Rsquared mtry
0.9832543 0.0327476 1
1.0301978 -0.0618160 2
1.0614164 -0.1271445 3

Recall from the Elastic Net you fitted a couple weeks ago on the hospital readmissions data that you optimized a hyperparameter, \(\lambda\), to get the best-tuned model. It is the same idea here with mtry. Remember, optimized results will be the lowest error as measured by Root Mean-Squared Error (\(RMSE\)) and the highest percent of the variance in the target explained, \(R^2\). In other words, you want a model that uses as much information as it can, which results in lower error, and betters your chances that it will return a good prediction!

Question 19C [0.5 points]

Which value of mtry gave the best results? (Hint: Can \(R^2\) values be negative?!?). Are you able to compare the \(RMSE\) of this model to the one you fitted in Question 16C? Why or why not? (Hint: Does your pre-processing step apply to the target too? You may have to do a little research!)

Your answer here.

4.2 Fitting a 10-fold cross-validated random forest (RF) model

You now feel decide to move forward with the 10-fold cross-validation. Remember that this is to help control for overfitting, which you can identify by looking at the error (\(RMSE\)) and fit (\(R^2\)) in each of the folds.

Define the training control method as 10-fold cross-validation (CV) and run the model

## Define training control for 10-fold cross-validation
ctrlCV <- trainControl(method = "cv", number = 10)

## Run the 10-fold CV random forest, including preprocessing instructions
rfCV <- train(
  ## Define the model
  monthlyPercMortality ~ Year + Month + lagged_monthlyPercMortality,
  ## Set the data
  data = mortalityTrain,
  ## Indicate that it is a random forest
  method = "rf",
  ## Indicate the training control method to use, defined above
  trControl = ctrlCV,
  ## Indicate that you want to extract importance at the end
  importance = TRUE,
  ## Indicate the tuning grid, defined explicitly earlier
  tuneGrid = tuneGrid
)

Look at the results (\(RMSE\) and \(R^2\)) of the OOB random forest

rfCV$results %>% 
  kable(
    format = "html",
    caption = "Table 4. Results of the 10-fold CV random forest on California Mortality (All Causes).") %>%
    kable_styling(bootstrap_options = c("hover"))
Table 4. Results of the 10-fold CV random forest on California Mortality (All Causes).
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 0.9824039 0.0362380 0.7065786 0.0248160 0.0161623 0.0199071
2 1.0378065 0.0187465 0.7708147 0.0344905 0.0132851 0.0287346
3 1.0647706 0.0172585 0.7970073 0.0397073 0.0120962 0.0352672

Question 19D [0.5 points]

What do you notice about the effects of cross-validation on your model? (Hint: Do you see any negative \(R^2\) values this time?) Which value of mtry gave the best results? How does the \(RMSE\) of that model compare to the one you chose as the optimal OOB model?

Your answer here.

4.3 Test the training model with the 2020 data

4.3.1 Preprocess your test data with the same centering/scaling

Since you used caret’s preProcess() during model training, then you also have to apply the same transformation to your testing data for predictions!

Notice that you will first store the pre-processed values for the training set in its own object, preProcInstructions, and then you apply the same pre-processing on the training set using the predict() function. This is separate from the predictions you will make on the testing set using the model!

## Apply the same preprocessing as training using predict(), 
## using the test data
mortalityTest <- predict(preProcInstructions, newdata = mortalityTest)

4.3.2 Predict using the best RF model

An advantage of using caret is that it will automatically identify your best model for us, which you can access by simply using the model object rfCV with predict(). Make sure you think about why you are using rfCV and not rfOOB here!

## Predict on pre-processed test data using the best trained RF model
rfPredictions <- predict(rfCV, newdata = mortalityTest)

4.3.3 Evaluate performance with \(RMSE\) and \(MAE\)

caret also has built-in functions to calculate key performance metrics, like \(RMSE\) and \(MAE\) for regression models or a confusion matrix with accuracy for classification models. You can read more about them in this chapter.

### Compute RMSE (Root Mean Squared Error)
rfRMSE <- RMSE(pred = rfPredictions, obs = mortalityTest$monthlyPercMortality)

## Compute MAE (Mean Absolute Error) as well, optionally
rfMAE <- MAE(pred = rfPredictions, obs = mortalityTest$monthlyPercMortality)

## Print results
print(paste0("Random Forest Test RMSE:   ", round(rfRMSE, 4)))
## [1] "Random Forest Test RMSE:   1.0062"
print(paste0("Random Forest Test MAE:    ", round(rfMAE, 4)))
## [1] "Random Forest Test MAE:    0.7402"

Question 19E [0.5 points]

Can you interpret the \(RMSE\) and \(MAE\) values? Remember all of the data were subject to pre-processing, both target and predictors!

How did the training model perform? Does it seem to be overfitting (training error >> testing error), underfitting (training error and testing error both high but roughly equal with a low \(R^2\)), or doing okay?

Your answer here.

4.3.4 Plot Actual vs. Predicted Values

Unimpressed with the results so far, you decide to plot the actual vs. predicted values because that can tell you whether your training model is likely leading to an under prediction or an over prediction relative to the true values of mortality. Using the ggExtra package, you add histograms to the margins of the scatterplot as well to help you interpret correctly.

## Make a scatterplot of the actual vs. predicted values
p <- ggplot(mortalityTest, aes(x = monthlyPercMortality, y = rfPredictions)) +
  ## Scatteplot with geom_point
  geom_point(alpha = 0.5, color = skittles[4]) +
  ## Add a line to help see the general trend
  geom_abline(color = skittles[3], linetype = "dashed", size = 1.5) +
  theme_minimal() +
  labs(
    title = "Actual vs. Predicted California Monthly Percent Mortality",
    subtitle = "Random Forest Predictions on Jan - Aug 2020",
    x = "Actual Mortality",
    y = "Predicted Mortality"
  )
## Add marginal histograms with the ggExtra package
ggMarginal(p, type = "histogram", fill = skittles[2], color = "gray30")

Question 19F [0.5 points]

Interpret the scatterplot of the actual vs. predicted values. What is happening to the predicted values? (Use the histogram on the side to help you, that’s why they’re there!) Make sure to mention whether the training model is leading to an under prediction or an over prediction of mortality. What real information might that provide your stakeholders, even if the model is underperforming?

Your answer here.

4.4 Extract variable importance from the final model

You also decide to extract the variable importance from the final trained model to share with your stakeholders.

Table 5. Results of the 10-fold CV random forest on California Mortality (All Causes).
Overall
Year 0.00
Month 100.00
lagged_monthlyPercMortality 90.32

4.5 Comparing your predictions to the real deaths data

Now, let’s pretend that you’re skipping ahead a few months in time and you’re actually able to assess just how well your prediction could do in reality. While what you could do is fit a 10-fold, cross-validated random forest on all of the data and download the remaining months of 2020 and use all of 2020 as your test dataset, instead I’m just going to ask you to do a simple correlation to see how badly - or well - it did in reality.

Question 19G [5 points]

Since the pandemic is over in reality, you can find the actual number of deaths, per county, here.

Hint: You can read this file in as a dataframe using read.csv() as follows! I also show you the structure (str()) of the data so you have an idea what you’re getting.

caliDeaths <- read.csv("https://s3.amazonaws.com/og-production-open-data-chelseama-892364687672/resources/078185a9-e3a7-403f-a546-b8c582f0a9d8/20241220_deaths_final_2019-2023_occurrence_county_month_sup.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJIENTAPKHZMIPXQ%2F20250615%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250615T190154Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=61c4009b5979f2d74a95396e2f77ace3d13315fe166588cb1210ef46d292f759")

caliDeaths %>% 
  str()
## 'data.frame':    633360 obs. of  13 variables:
##  $ Year              : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
##  $ Month             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ County            : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
##  $ Geography_Type    : chr  "Occurrence" "Occurrence" "Occurrence" "Occurrence" ...
##  $ Strata            : chr  "Total Population" "Age" "Age" "Age" ...
##  $ Strata_Name       : chr  "Total Population" "Under 1 year" "1-4 years" "5-14 years" ...
##  $ Cause             : chr  "ALL" "ALL" "ALL" "ALL" ...
##  $ Cause_Desc        : chr  "All causes (total)" "All causes (total)" "All causes (total)" "All causes (total)" ...
##  $ ICD_Revision      : chr  "ICD-10" "ICD-10" "ICD-10" "ICD-10" ...
##  $ Count             : int  942 NA NA NA NA 15 17 45 121 175 ...
##  $ Annotation_Code   : int  NA 1 1 1 1 NA NA NA NA NA ...
##  $ Annotation_Desc   : chr  "" "Cell suppressed for small numbers" "Cell suppressed for small numbers" "Cell suppressed for small numbers" ...
##  $ Data_Revision_Date: chr  "12/20/2024" "12/20/2024" "12/20/2024" "12/20/2024" ...

Remember to filter for the proper condition, Year, and Months to get the data you need. Take time to explore the data, and make sure that you are matching the data we’ve been using (total mortality, all causes of death) to your filter.

## Your code here

Now make sure to compute the monthly percent mortality as you have done before. Otherwise, you won’t be comparing apples-to-apples!

## Your code here

Lastly, create (1) a scatterplot with trend line and (2) perform a correlation test on your Sept to December 2020 predictions relative to the true Sept to December 2020 monthly percent mortality.

## Your code here

For full credit include a brief interpretation of your findings.

Question 19H [1 point]

By now, it should be clear that there is a monthly seasonality to deaths that has nothing to do with what year it is, even though 2020 was a pandemic year. In fact, that seasonality is so strong that it ends up as a very strong predictor - about as strong as the spatial autocorrelation! Although I wish I could dive into that with you deeply now, I would not be able to do the topic justice. Note that Merrimack offers an entire course on time series - but if you are unable to take the course but want to learn more on your own, one resource I can recommend is A Little Book of R For Time Series by Avril Coghlan.

Since you can’t go deeply into doing it, instead you will do a little reading/research. I suggest reading about temporal autocorrelation or seasonal momentum and how adding time-series lags can help. How might time series lags capture what the “Month” variable alone may not fully capture, in a similar way to how “County” alone may not fully capture spatial relationships?

Note: Doing the time-aware data partition into before 2020 and after 2020 IS a form of controlling for temporal patterns/seasonality, albeit not as robust as including temporally lagged variables.

Your answer here.

Question 19I [1 point]

Now, let’s say you add several time-series lagged variables to your random forest; but with 10-fold CV, the testing MAE is 0.129. The scale of the target variable is unchanged - recall that previously, you got a testing MAE of 0.745. Does this suggest that the time series lagged variables are helping or hurting the model? If I tell you that the training MAE is 0.134, do you think the model is overfitted, underfitted, or doing pretty well?

Your answer here.

Would you like to see how I got this? This source code contains all of the steps of analysis, with comments, and will show you these same results! Make sure to open the source code if you want to see the step-by-step instructions and not just the output! In brief, what I do is engineer temporal lag and rolling means features to inform the model how the target behaved in the past. Lags are explicitly accounting for temporal autocorrelation; rolling averages show recent trends and smooth noise. Then, I do the time-aware split again using the 2020 data as my holdout testing set. I then add the lagged and rolling means features, along with the spatial lag, month, and year variables as predictors to the RF model and assess performance on the testing holdout set. Note that RF is a particularly good choice for temporal models because it is non-linear; further, adding lags makes it partially time-aware. However, don’t forget to ALWAYS validate with time held out, not random sampling, in temporal models!

source("mortalityTemporalLagRF.R")
## [1] "These are the results of a temporally lagged 10-CV random forest."
## [1] "These are the training performance metrics:"
##   mtry      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1    2 0.3631824 0.8684731 0.2016477 0.05424254 0.03044163 0.02170595
## 2    5 0.2876949 0.9179475 0.1560409 0.04812464 0.02206279 0.01703057
## 3    8 0.2412530 0.9424415 0.1309098 0.04122332 0.01668333 0.01497029
## [1] "These are the testing performance metrics using the 2020 holdout set:"
## [1] "Random Forest Test RMSE:   0.2227"
## [1] "Random Forest Test MAE:    0.1269"

Question 19J [1 point]

What is something else you can do to improve your model that you have not discussed or explored at all here? Hint: Random forest models need complexity!

Your answer here.

Question 19K [3 points]

Take all of the findings and interpretations from questions 18A-J and write up a short paragraph or two summary for your stakeholders. Your writing should not be overly technical but it should include caveats and limitations. Make sure to include 1-2 sentences of recommendations about what you would do next.

Your answer here.

5 Zooming in on Mortality Causes

California was one of the strictest states when it came to shutdown measures during the COVID-19 pandemic, with both statewide enforcement and a duration longer than some other states in the country. California declared a state of emergency on March 4, 2020 with a a mandatory statewide stay-at-home order issued on March 19, 2020. This statewide order didn’t end until January 25, 2021, but it still wouldn’t be until April 6, 2021 when the state announced plans to fully reopen the economy by June 15, 2021.

Your stakeholder is actually not as interested in deaths from COVID-19, but actually more interested in how the mandatory stay-at-home order seems to affecting mortality due to certain conditions. For example, are people less likely to die from accident or injury since they’re under stay-at-home orders?

5.1 Death from Accident & Injury

Because of the mandatory COVID-19 shutdown in California, which has led to generally reduced movement throughout California, your stakeholder would like to know if the number of accidental deaths declined enough during COVID-19 shutdown to generate a signal.

HYPOTHESIS: You hypothesize that because of intense, mandatory stay-at-home measures, deaths attributed to accidents and injuries (INJ) declined during the COVID-19 shutdown during the months of 2020 in California.

You will start, as with all good analysis plans, with some more Exploratory Data Analysis.

5.1.1 Sex-specific Effects on INJ Mortality

5.1.1.1 Total Deaths by Sex

Your stakeholder asks you to be sure to stratify by the sex of the decedent, because they suspect there is a difference in death rates by the decedent’s biological sex.

The leading causes of death are located in the sexDeaths table because that was how the data were provided to you. They are separated by sex, thus you can check to see if there are sex differences that could matter for your analysis. You need to join some of your tables and calculate some additional metrics as well, including:

  • Proportion of deaths from accident or injury out of all deaths (no longer adjusting for population size)

  • Ratio of adult females to males in the population

  • Ratio of females to males in the population among those over 65 years of age

  • Ratio of all adults over the age of 65 (retirement age) to those under retirement age

You have opted to adjust per total deaths, rather than population size, so that you can better examine the fraction of deaths due to particular causes regardless of the underlying size of the population. However, you need to account for the ratio of females to males in the population, as it might be useful now or in future analyses to have access to the ratio of females to males. Similarly, having the ratio of retirees in the population (those over the age of 65 year) relative to non-retirees in the population may also be useful in future analyses. Further, the size of a county may still matter; so you will likely want to retain the total sample size of the county as a control variable as well. Note that, in a more detailed analysis, population density would be better than population size, but this is the proxy of county size you currently have to work with!

## Join the sex deaths table with the census data; then filter for INJ
injDeaths <- sexDeaths %>% 
  full_join(census, by = "County") %>% 
  ## Here is where you filter for a specific cause of death relative to sex
  filter(leadingCauseDeath == "INJ") %>% 
  ## Now full join with the first 4 columns of the totalDeaths table to get 
  ## total deaths also
  full_join(totalDeaths[ , 1:4], by = c("County", "Year", "Month")) %>%
  ## Rename the sexDeaths column to something more appropriate
  rename(injuryDeaths = sexDeaths) %>% 
  ## Calculate the proportion of all deaths that are from injury 
  ## RELATIVE TO ALL DEATHS (not population!!)
  mutate(deathsInjuriesProportion = injuryDeaths/totalDeaths,
         ## The ratio of females to males in the population
         femaleMaleRatio = femalesN/malesN,   
         ## The ratio of 65+ females to males in the population
         femaleMaleRatio65 = females65plus/males65plus,  
         ## Ratio of retirees to non-retirees
         retireeRatio = over65/totalN,  
         ## Convert the date to a date format for plotting
         Date = as.Date(paste(Year, Month, "1", sep = "-"), 
                        format = "%Y-%m-%d")) %>%    
         ## Illegal div by 0 turned to NaN; turn back to 0
  mutate(deathsInjuriesProportion = ifelse(deathsInjuriesProportion == "NaN", 
                                           0, deathsInjuriesProportion))  

Question 20A [0.5 points]:

Would it be fair to model the proportion of deaths due to injury or accident without accounting for an effect of sex? Why or why not?

Your answer here.

5.1.1.2 Proportion of Deaths by Sex

What if you calculate the proportion of deaths, for each sex, instead? Wouldn’t that give you a more accurate portrait of any sex-level differences that might exist?

Question 20B [0.5 points]:

What do you notice about the trend? Is there anything compelling here? Can you find any possible explanation for what happened in May (month 5) of 2020?

Your answer here.

5.1.2 Mapping injury-related deaths by stay-at-home orders & sex

So, hopefully you’ve noticed what I’ve noticed about the relationship between the stay-at-home orders and the proportion of overall deaths due to accident or injury, as well as the effect of the shutdown in terms of both decedent biological sexes. You know from your earlier graph that females have a lower proportion of overall death due to accident or injury; next, it would be nice to see what the relationship is spatiotemporally (before and after shutdown) on average accident / injury mortality overall, as well as for each of the sexes separately. This will allow you to see if there is a geographic pattern, as one might expect in denser areas with more roadways, with the effect or not.

Let’s add a label to the dataset to identify whether a month was under stay-at-home orders or not, and call the new feature stayAtHome. Notice that I could have coded it as a dummy variable (0 or 1) where 1 is the stay-at-home order condition, but instead I chose to encode it as a category. I can also change the encoding later, if needed. Note also that although I chose to use Date as the condition for my ifelse(), you could just as easily have chosen a joint conditional using bothMonth and Year columns!

## Encode stayAtHome as any date after March 19, 2020
injDeaths <- injDeaths %>% 
  mutate(stayAtHome = ifelse(Date >= "2020-03-19", "Stay-at-Home", "None"))

Next, let’s calculate a summarized dataframe that groups by stayAtHome, Year, and County, then summarizes an average deathsInjuriesProportion for use with the maps. Notice that rather than storing the summarized dataframe, I simply piped it through to join it with the geometry from the shapefile for map-making this time. I was feeling a bit lazy.

##### Calculate the proportion of all deaths due to injury
injuryDeathSummary <- injDeaths %>% 
  ## Group by stay-at-home order, year, county, AND sex
  group_by(stayAtHome, Year, County, biologicalSex) %>% 
  ## Change the name to make the join easier
  rename(COUNTY_NAM = County) %>% 
  ## Mean proportion of deaths for each month-year-county-sex combo
  summarise(deathsInjuriesPercent = mean(deathsInjuriesProportion, na.rm = TRUE)*100) %>% 
  ## Join to get the geometries for mapping
  full_join(cali_sf, by = "COUNTY_NAM", relationship = "many-to-many") %>% 
  ## Coerce to make sure it is seen as a shapefile
  st_as_sf() %>% 
  ## Remove any of the island geometries as you did in the first part
  filter(is.na(ISLAND)) %>% 
  ungroup() 

#### Find the five number summary
summ <- fivenum(injuryDeathSummary$deathsInjuriesPercent)
paste0("The five-number summary of % of deaths that are accident/injury-related is: ")
## [1] "The five-number summary of % of deaths that are accident/injury-related is: "
paste(c("Min", "Lower Quartile", "Median", "Upper Quartile", "Max"), 
      round(summ, 3))
## [1] "Min 0"                "Lower Quartile 0"     "Median 0"            
## [4] "Upper Quartile 3.343" "Max 18.489"
## Notice that, with the minimum, 1st quartile, AND median all equal to 0, we
## really only have three groups

##### Now add on the cuts injuryDeathSummary labels for mapping:
injuryDeathSummary <- injuryDeathSummary %>% 
  mutate(deathLabels = cut(deathsInjuriesPercent, 
                          breaks = c(0, 3.343, 18.489, Inf),
                          labels = c("0-3.342", "3.343-18.489", "18.489+"),
                          ordered_result = TRUE, right = F)) 

Question 21A [0.5 points]::

Do you think there is any evidence that the percent of deaths due to accident & injury changed as a result of the stay-at-home orders? What would provide stronger evidence of a change in percent of deaths due to accident and injury as a result of the shutdown? Do these maps include any evidence of interaction with the sex of the decedent?

Hint: Make sure to explain what you would want to do to these maps, either in terms of filtering and/or additional faceting, to better address whether (1) the shutdown has had a likely effect and (2) whether there is an interaction with biological sex on the proportion of deaths due to accident or injury.

Your answer here.

Question 21B [0.5 points]::

Why are some counties filled gray? Are the data truly missing or is it a data maturity problem? Or, is there another reason? (Hint: What will happen if the number of deaths in a county are very small?)

Your answer here.

Question 21C [3 points]::

Answer one of the two situations you answered in Question 21A. Either make more compelling maps before and after the shutdown OR show interactivity with biological sex, your choice.

## Your code here.

5.1.3 Controlling for population-level differences in sex ratio

Even if you didn’t explore biological sex on the map, it seems that there are likely some sex-specific effects in percent of deaths due to accident and injury. But what if some counties have a skewed sex ratio? You might need to adjust for that! Although you might expect, a priori, for all counties to have a female to male ratio of 1:1, in reality that isn’t always the case.

In order to determine if you need to control for any underlying population-level differences in the sex ratio of each county, you make a simple histogram and plot a 1:1 line where the proportion of females in the population will equal males. You do this for each county. You make a blue, solid line for the mean sex-ratio, and a black, dashed line to show the reference 1:1 line.

Note that this ratio is coming from the Census data that you imported earlier; you calculated this sex ratio in an earlier code chunk from those data. Thus, these data are from 2020.

You can see that you have a very strongly left-skewed distribution, with the mean of the female:male ratio dragged to the left. What does this mean, exactly? You have more counties with a higher proportion of men than women than otherwise! And since men seem to have a higher proportion of deaths by accident / injury as opposed to females, you need to decide if you think you need to account for this in your maps. Accounting for it in the model will be easy - you just include the female:male ratio as a predictor variable. However, in the maps it is actually a little trickier and would require a mathematical correction that could be difficult for your stakeholders to understand. Is it even necessary then?

Question 22 [1 point]:

Calculate the correlation of femaleMaleRatio and the deathsInjuriesProportion, making sure to be careful to drop or exclude any NAs before making your calculation.

## Your code here.

Do you think that this correlation coefficient indicates that you need to make a correction on your maps, or is it likely sufficient just to make the correction in your models? In other words, large correlation coefficients (in the absolute value sense) would indicate a need to correct the maps; smaller coefficients would indicate that it’s probably not a big deal for your visualizations provided to stakeholders.

Your answer here.

Question 23 [1 point]:

Now, take a look at the scatter plot below with a best-fit line. Does the plot agree with the correlation coefficient (i.e., direction of slope) you obtained? Why or why not?

Your answer here.

Question 24 [3 points]:

Originally, I was going to walk you through predictive modeling here as well - but given a need to abbreviate this project, instead I am going to ask you to simply construct data science question and hypothesis given the information you have here so far. Are there any variables you pulled from the Census that would be worth considering here as well?

Choose what kind of model you would do. Would you include spatial lags? Ideally, would you include temporal lags?

Your answers here.

5.2 Choose your own cause-of-death to explore.

Question 25 [15 points]:

Now for the choose your own adventure portion of this assignment! Table 2 gives you a list of ICD-10 codes for the causes of death available in the dataset. You should be able to use the code in Section 5.1, with minimal adjustment, to explore your data.

Feel free to get creative! You are not required to continue stratifying by sex, unless you want to; in the Census data you pulled, you also have stratifications by retirement age status (over 65 years) or double-stratified by both sex and retirement age. You may also choose not to use any stratification if you wish.

For full credit, I will be looking for:
  1. Annual chloropeth maps showing your cause-of-death, like this one

  2. A line graph showing changes, by month and year, in your cause-of-death, somewhat like this one with stratification optional

  3. At least one other form of exploration to form a testable question or hypothesis, including but not limited to:

  • A correlation test
  • A scatterplot
  • Stratified boxplots
  • Feel free to be creative!
  1. Interpretation of all graphs or analyses explored.

  2. A data science question you would like to test

  3. A data science hypothesis based on what you found [Note: if you find nothing compelling, you may defend with evidence why there is no point developing a hypothesis instead.]